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Abstract 

We propose a new method for clustering based on the local minimization of the 7- 
divergence, which we call the spontaneous clustering. The greatest advantage of the 
proposed method is that it automatically detects the number of clusters that adequately 
reflect the data structure. In contrast, exiting methods such as i^-means, fuzzy c-means, 
and model based clustering need to prescribe the number of clusters. We detect all the 
local minimum points of the 7-divergence, which are defined as the centers of clusters. 
A necessary and sufficient condition for the 7-divergence to have the local minimum 
points is also derived in a simple setting. A simulation study and a real data analysis 
are performed to compare our proposal with existing methods . 



1 Introduction 



Cluster a nalysis is a common procedure for grouping similar objects in unsupervised 



learning (|jain et al.l 119991 IXu and Wunschl 120051 iHastie et al.l . 120091) . The procedure 



stably produces a classification, and is frequently used as a preprocessing before su- 
pervised learning. Cluster analysis has wid e applications oy er m any discip l ines in ex- 



ploratory data analysis. See, for example. 



Jin et al, 



( 2011 ) and 



Wu et al 



(1201 ih for 



recent developments. There are mainly two approaches in cluster analysis. One is the 
hierarchical approach which describes a tree structure called dendrogram. The other is 



the approach of data space partition such as Ji'-means algorithm. This paper focuses on 



the latter approach from a view point of statistical pattern recognition. 



We propose what we call the spontaneous clustering. It starts with finding cen- 
ters of clusters in a data set. For this purpose, we employ a loss function derived 
from the power entropy with the power index 7. It is referred to the 7-loss function 



(iFujisawa and EguchiL |2008|; Eguchi and Katd . l20ld ). Here is a motivational example 
for the proposal of the spontaneous clustering. Consider the problem of estimating 
Gaussian mean parameter fi. The maximum likelihood estimator (MLE) of fi is given 
by the arithmetic mean of the data set as the unique maximum point of the log likelihood 
function. It is known that the MLE poorly behaves in various situations where Gaus- 
sianity assumption is inappropriate. For example, the log likelihood function suggests 
rather a misleading summary as seen in panel (a) of Figure 1 . Alternatively, the 7-loss 
function properly reflects the data shape. For the same data set in panel (a) of Figure 1, 
panel (b) shows that the 7-loss function has two local minimum points corresponding 
to the two normal distributions. We will propose to determine the centers of clusters by 
such local minimum points. 

Almost all procedures via data space partition need the number of clusters a priori. 
The selection of the number of clusters is a maj or challenge in c l uster a nalysis. A lot of 



methods have been proposed in the literature (|Xu and Wunsch 



20051). Our clustering 



method can find the number of clusters automatically as long as the value of 7 is prop- 
erly fixed. The name of the spontaneous clustering comes from this property. Instead 
of the number of clusters, the value of power index 7 should be determined. We will 
propose two methods to accomplish this aim. One is a heuristic choice of 7 that merely 
relies on the range of the data, and the other is a more sophisticated method based on 



Akaike Information Criterion (AIC). 

This paper is organized as follows. Section 2 describes the algorithm of the spon- 
taneous clustering and selection procedure of the value of 7. In section 3 the existence 
of the local minimum points is discussed. Section 4 investigates the numerical proper- 
ties of the spontaneous clustering. In section 5 a real data analysis is given. Further a 
discussion is presented in section 6. 

2 Spontaneous Clustering 

We begin with a statistical formulation of cluster analysis. Suppose the p-dimensional 
density function of the population distribution is given by 

K K 

fi'(^) = ^'^kfki.x), ^Tfe = 1, Tfc > 0, A; = l,...,ir, (1) 

fc=l fc=l 

where fk{x) is a density function. Let {xi, . . . , x„} be a data set generated from g. 
We apply the 7-estimation method to this data set. The 7-loss function for the normal 
distribution with the identity covariance matrix is given by 

1 " 
LM = ^exp(^--||x, -^f j , (2) 

apart from a constant, where /i and || ■ || denote the mean vector and the Euclidean 
norm, respectively. In the remainder of the paper, we omit a constant term that does not 
affect the optimization. In panel (b) of Figured! L^^fj) is illustrated. See appendix B 
for a general introduction to the 7-loss function. It is expected that the 7-loss function 
L^(yu) has K local minimum points corresponding to K mean vectors with respect 
to fi, . . . , fx- Then we expect that the local minimum points can help us to define 
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the centers of K clusters and to build K clusters in a similar way to the A'-means 
algorithm. The covariance structure of the data set is taken into consideration in a 
subsequent discussion. 



2.1 7 -loss Function for the Normal Distribution 



We consider the 7-loss function for the normal distribution with mean vector /i and 
covariance matrix S, 



n 



An iteration algorith 



r n to fi nd t he local minimum poin ts of L^{^, S) is proposed in 



Fujisawa and Eguchil (|2008[) and 



Eguchi and Katd (l20ld ). It is obtained by differentiat- 



ing Ly{n, S) with respect to n and E ^ and setting the derivatives to 0. The algorithm is 



a concave-convex procedure (CCCP) (|Yuille and Rangarajan 



20031), so that it is guar- 



anteed to decrease the 7-loss function monotonically as the iteration step t increases. It 
is described as follows. 

Step 1 Set appropriate /iq and Sq as initial values. 

Step 2 Given fit and S^, calculate /it+i and T^t+i by the following update formula, 

n 

Ht+1 = ^w^{xi,Ht,T.t)xi, (3) 

n 

Et+i = (1 +7)^M;^(xi,/ii, St)(a;i -/it+i)(xi -/it+i)"^, (4) 



i=l 



where 



W-y{X, ft, S) 



exp (-|(x - /x)'^S ^(x-/i)) 
ELi exp {-^{xj - /i)TS-i(xj - /i)) 



Step 3 For a sufficiently small number e, repeat Step 2 while 

where || ■ ||f denotes the Frobenius norm. 

If 7 = 0, then the right hand sides of equations (|3]) and dD are equal to the sample mean 
vector and covariance matrix, respectively, which are nothing but the MLEs. If our aim 
is to obtain the local minimum points of L^{ii), then we only have to update Ht and fix 
Si to be the identity matrix /. Similarly if our aim is to obtain the local minimum points 
of L^(yU, S) with fixed ji, then we only have to update S^ and fix fit = /i- 

2.2 Algorithm of the Spontaneous Clustering 

In general, the spontaneous clustering based on a density function /(x, 6) with param- 
eter is defined as follows. 

Spontaneous Clustering 

Step 1 Find the local minimum points of L^{6), denoted by ^i, . . . , 9k, where L.^{6) is 
the 7-I0SS function for /(x, Q). 

Step 2 Consider K clusters according to ^1, ... , 6k, and assign the data to the clusters. 

In a special case, the spontaneous clustering based on the normal distribution is defined 
as follows. We set 6^ and 6(^,s) are the empty sets at the start of the algorithm. The 
algorithm of subsection 2.1 is employed in the spontaneous clustering below. 

Spontaneous Clustering Based on the Normal Distribution 



Step 1-1 If 9^ is the empty set, choose M initial values X(i), . . . , x^m) in the data 
set {xi, . . . , x„} at random. Otherwise, choose initial values in {xi, . . . , x„} as 
follows: X(i), . . . , X(M) are M maximum points of (i(-, Q^), where 



(i(x, 6^) = min ||x — jl\ 



Step 1-2 Apply the algorithm in subsection l2.1l to the data set M times with each initial 
value X(i), z = 1, . . . , M to find the local minimum points of L^^ji). Then add the 
obtained local minimum points to 6^. 

Step 1-3 Repeat Step 1-1 and 1 -2 until the number of elements in 0^ does not increase. 

Step 1-4 For each local minimum point fi E O^j, obtain a minimum point of L^^(jl, S) 
with respect to E, denoted by S, with the algorithm in subsection 12. II Then add 

(/i,S) to 6(^,2)- 

Step 2 Write 0(^,s) by {(/ifc, Tjk)}k=i ^^^ assign each observation Xj to the k-th cluster 
with 

k = argmin(xi - /ifc)'^S^^(xi - ftk)- 

k=l,...,K 

In the algorithm of the spontaneous clustering, we define (/ifc, S^), k = 1, . . . , K as the 
centers and the covariance matrices of clusters. In the remainder of this paper, we focus 
on the spontaneous clustering based on the normal distribution. 

2.3 Selection Procedure for 7 

The value of power index 7 plays a key role in the spontaneous clustering, because 7 
affects the number of clusters obtained by the spontaneous clustering. We propose two 



methods to select the value of 7. One is a heuristic choice of 7 that depends on the 
range of the data. Our proposal is 7 = 12/ R^, where R is defined by the maximum 
range: 

R = max < I max Xij ) — I min Xij J > , 

j=l,...,p [^ yi=l,...,n J yi=l,...,n J J 

where Xi = {xn, . . . , Xip)^ . The outline of the derivation of 7 is as follows. Suppose the 
data set is generated from the mixture of two normal distributions centered at ^i and fi2 
with the identity covariance matrix and the same mixing proportion, respectively. Our 
simulation result suggests that if ||(/ii — /i2)/2|| = 3-^2/2 = 2.12, then the value of 
7 needs to be more than or equal to 1 for two local minimum points of -^^(/i) to exist. 
Proposition [3T| tells that if all the data are multiplied by a scalar a and the spontaneous 
clustering is applied to the transformed data, then the value of 7 needs to be more than 
or equal to a^^ to guarantee the existence of two local minimum points of L^(n). If 
II (/ii — /i2)/2|| = r, then a = r/{3\/2/2). Hence we propose to use the value of 7 
defined as 

The value of r can be estimated by the range of the data. Let Rj be the range of the j-th 
variable. If there are K disjoint clusters lying side by side on a line parallel to the axis 
of the j-th variable, then we can estimate r by Rj/{2K) as is just illustrated in Figure 
I2I There are p variables, so p directions have to be considered simultaneously. We use 
the maximum range R, and estimate r by R/{2K). The value of K can be determined 
from our prior knowledge about the possible number of clusters. If i^ = 2, we have 
7 = 72/ R^. We observe that this rule works well in several empirical studies although 
the discussion does not completely have the theoretical background. 
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We also propose a more sophisticated method based on AIC. The value of 7 which 
minimizes AIC is recommended as the optimal selection of 7. Let K^ be the number 
of clusters and {fi-^k-, S^fc), k = 1, . . . , K^ be the centers and the covariance matrices 
of clusters resulting from the spontaneous clustering. Let 0(x, fx, S) be the density 
function of the normal distribution with mean vector fi and covariance matrix S. Then 
(f){x, fi^k, S^fc) is used as a density estimator of mixture component fk{x) in ([T]). The 
result of the spontaneous clustering implies the mixture of normal distributions as an 
estimator of the density function of the population distribution g in ([T]), 

9iix) = y^f^fc0(x,/i^fc,s^fc), 

k=l 

where r-yk is an estimator of mixing proportion r^ defined as the proportion of the ob- 
servations assigned to the A;-th cluster. The AIC based on g^ is defined as follows. 

AIC-, = -2^1og^^(x,) + 2 |j^^ P(P + ^) + i^^ _ i| . 
The value of 7 minimizing AIC^ is proposed as the optimal selection of 7. 

3 Behavior of the 7-loss Function 

We provide a justification for the spontaneous clustering by exploring its theoretical 
aspects. The key fact is that the 7-loss function L^^jj) has K local minimum points if 
the data set consists of K cluster groups. 

3.1 Nonconvexity 

We consider the reason why the 7-loss function has local minimum points as illustrated 
in panel (b) of Figure [T] The optimization problem for a nonconvex function which is 



expres sed a s difference of two c onvex functions has been considered in lYuille and Rangarajan 



(l2003h and lAn and Tad (l2005h . Effective algorithms such as CCCP and DCA have 



been developed. Actually, a monotonic transformation of the 7-loss function can be 
expressed as difference of two convex functions, and this expression gives the reason 
why the 7-loss function has local minimum points. Rewrite L^{fL) as 



L^(/i) 



1 
-— exp 
n 









7 T 

2/^ /^ 



The local minimum points of L^{fi) are equal to local maximum points of r^(yu) 
T^^\lj,)-T^^\lj,), where 

r?{f^) = log I X; exp (7x7^ - Ixjx,) I , rf (/.) 

Then T\ (/i) is obviously a convex function and has a constant Hessian matrix with 
positive diagonal elements, which means the surface of T^ (/i) is curved. T^ (/i) is 



7 T 
2^ ^- 



also a convex function because its Hessian matrix is given by 

^2f(1) 



c* J- 7 [fJ') 2 \~^ / T\f — \{ — ^ 

'K 'K y 7 / ^ UlyXi, fJj, -I ) \Xi X^^j yXi X^^j 



(6) 



i=l 



where x^^ = Y^^=i w^{xi,^, I)xi, and the Hessian matrix is obviously positive defi- 
nite. However, the Hessian matrix of T\'{fi) varies depending on the data and /i, and 
becomes close to the zero matrix in a neighborhood where observations are concen- 
trated. This fact is clear from the form of the Hessian matrix Q and means the surface 
of r^ ' (yu) is almost flat in such a neighborhood. Difference between the flat surface 
and the curved surface causes local maximum points of r^(/i). Figure |3]illustrates such 
a phenomenon, where the red, green, and blue lines show Tjlfi), T^ ^(/i), and r^(/i), 
respectively, with dimension p = 1 and 7 = 3. The graphs of F^ '{jj,) and F^(/i) are 
shifted to take at /i = 0. 
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3.2 Existence of Local Minimum Points 

We consider a condition for the existence of local minimum points of Lj{fj.). As we 
discussed in subsection 2.2, the local minimum points of L^{ii) are defined as the cen- 
ters of clusters, so it is important to know when the 7-loss function has local minimum 
points. 

To simplify the argument, we assume that the data set is generated from the mixture 
of two normal distributions with covariance matrix a'^I, 

g{x) = ri0(x, /ii, cr'^I) + r20(x, ^2, o"^/), n + T2 = 1, Tk > 0, k = 1, 2. 

For easy calculation, we consider n = 00. As n tends to 00, L^{n) almost surely 
converges to the 7-cross entropy defined by 



C-yig, 0(-, At, !)) = - j 9{x)(l){x, jj., Ifdx. (7) 

See appendix B for the detailed discussion about the 7-cross entropy. C^{g, 0(-, /i, /)) 
becomes 

1,2 

1,-1 o V \ I / / 



k=l,2 



oc — 

fe=l,2 



which is nothing but the minus density function of the mixture of two normal distribu- 
tions with the same covariance matrix (cr^ + I/7)/. Hence the local minimum points of 
C^{g, 0(-, /i, /)) are equal to the modes of the density function of the normal mixture. 
Figure H] shows —C^{g,(p{-,^,I)) with dimension p = 2, where — C^((7, 0(-,/i, /)) has 
one or two modes depending on the values of fii, /i2, ri, T2, and 7. For the univariate 
case, a necessary and sufficient condition that the density function of the mixture of two 
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normal distributio ns should be bimod al is given in lde Helguerol (| 19041) . We use a simi 



lar technique as in lde Helguero 



(|l904|) to obtain a necessary and sufficient condition for 



C-y{g, 0(-, /i, J)) to have two local minimum points. 

Proposition 3.1 Letu = {fii—fi2)/'2andd= ||z/|p — (0-^ + 1/7). ThenC^{g,(j){-, n,I)) 
has two local minimum points if and only if the following three conditions hold: 



d > 0, 

/ 27 /-\ 1 ( 


/ 7-2 


(8) 
(9) 

(10) 


VI + 7^ / 1 + 70-2 v ' 

'D 1 ^ \\T\\\/d\ ^ ^ f\l' 


V 1 + 70-2 / 1 + 70-2 V 



Especially, if ti = T2, then ([9]) and f ITOl) /zoW /or anj c? > 0. W/zen ?/ze ?wo local 
minimum points exist, they lie on the segment between /xi and fi2- One closer to /xi and 
the other to fi2 are denoted by /i^ and fi^, respectively. Then ||/ii — /i^|| and \\fi2 — H2\\ 
are bounded above by 



w\ 



w\ 



a^ 



By proposition 13. 1[ for any a^, if jii and ^2 are distinct enough, then there exists 7 
that guarantees the existence of two local minimum points of C^{g, (f){-, n, /)), and two 
clusters are defined at the same instant. In addition, the center of a cluster fxl becomes 
arbitrarily close to /ifc (k = 1, 2), when \\fii — ^2]] becomes large. 

4 Simulation 

The performance of the spontaneous clustering was investigated through Monte Carlo 
experiments. A comparison of the spontaneous clustering with the i^-means algorithm 
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and the model based clustering (MBC) was also implemented. 



4.1 Case of Spherical Clusters 

We demonstrate the performance of the spontaneous clustering in comparison with the 
ii'-means algorithm. In this simulation, it is supposed that the covariance matrices 
of clusters are known to be the identity matrix. The value of 7 for the spontaneous 
clustering is determined by the two methods described in subsection 12.31 The number 
of clusters for the if -means algorithm is determined by two methods described below. 
The performance of clustering is measured by B HI defined later 



For the J'f - means algorithm, themethod by ICalihski and Harabaszl (|l974) and the 



gap statistic by iTibshirani et al.l (|200lh were used to fix the number of clusters. Let 
B(k) and W(k) be the betw een- and within-cluster sums of squares with k clusters. 



Calihski and Harabasa (|1974|) propose to select the number of clusters k which maxi- 



mizes CH(A;), where CH(/c) is defined as 



CH(fc) 



On the other hand. 



Tibshirani et al. 



B{k)/{k-l) 
W{k)/{n-k)' 



(|200l|) propose to choose the value of k which 



maximizes Gap„(A;) = E^{\og{Wk)) — log(H^fc), where E*-^ denotes expectation under 
a sample of size n from the reference distribution. 

The sample of size 200 is generated from the mixture of five standard normal distri- 
butions centered at (0, 0)^, (3, 3)^, (—3, 3)^, (—3, —3)^, (3, —3)^ with equal mixing 
proportion. Figure |5] displays an example sample. We simulated 100 runs, and com- 
pared clustering results from the spontaneous clustering with those from the Ji -means 
algorithm. Figure |6] shows the value of AIC and the number of clusters resulting from 
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the spontaneous clustering for the sample in Figured The selected value of 7 based on 
AICisO.7. 

Table [H displays the frequency of choosing K clusters for each of the methods for 
different values of K. All methods except the fC-means algorithm with Gap chose the 
true number of clusters in almost every simulation run. To measu re the perfo rmance 



20 Uh . which 



of the clustering, we used Biological Homogeneity Index (BHI) (|Wu . 

measures the homogeneity between the cluster C = {Ci, . . . , Ck} and the biological 

category or subtype B = {Bi, . . . , Bl}, 



K 



BHi(c.B)4E;^;i;^ E KB'-'-i"^'). (11) 

where -B'^*^ G B is the subtype for the observation Xj and n^ is the number of the 
observations in Ck- This index is bounded above by 1 meaning the perfect homogene- 
ity between the clusters and the biological categories. The mean value of BHI over 
100 simulation runs for each method is shown in Table |2] All methods except the K- 
means algorithm with Gap have good clustering results. In every simulation run, if 
each method detected five clusters for a sample, we calculated the Euclidean distance 
between the center of a cluster and the mean vector of the corresponding normal com- 
ponent of the normal mixture. The mean value of the distance is also shown in Table 
[2I where DM1, . . . , DM5 represent the mean value for cluster 1, . . . , 5, respectively. In 
this simulation setting, the centers obtained by the spontaneous clustering vary more 
than those obtained by the A'-means algorithm. 

To summarize, this simulation example shows that the spontaneous clustering with 
the range and AIC has almost the same performance as the A' -means algorithm with 
CH, and better performance than the A'-means algorithm with Gap. 
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4.2 Case of Ellipsoidal Clusters 

We demonstrate the performance of the spontaneous clustering in comparison with the 
MBC, in which the component density is normal. It is supposed that the covariance 
matrices of clusters are heterogeneous and unknown. The value of 7 for the spontaneous 
clustering and the number of clusters for the MBC are determined based on AIC. 

The sample of size 100 is generated from the mixture of two bivariate normal dis- 
tributions with mean vectors (0, 0)^, (3, 3)^, and covariance matrices 



1 0.5 



^0.5 Xj 



( \ 

2 -0.5 



\-0.5 2 / 



Figure |7] displays an example sample, and Figure [8] shows the value of AIC and the 
number of clusters resulting from the spontaneous clustering for the sample. Note that 
we use two values 71 and 72 as power index 7. 71 is used for L^{ii) when defining 
the centers of clusters, and 72 for L^(/i, S) when defining the covariance matrices. The 
selected values of 71 and 72 for the sample in Figure |7]are 71 = 0.25 and 72 = 0.7. We 
simulated 100 runs, and compared the clustering result from the spontaneous clustering 
with that from MBC. 

Table [3] displays the frequency of choosing K clusters for each of the clustering 
algorithms for different values of K. The spontaneous clustering chose the true number 
of clusters, while the MBC selected large number of clusters 3-10, 39 frequencies. 
The mean value of BHI is shown in Table IH Both clustering algorithms show good 
performance. In every simulation run, if each clustering method detected two clusters 
for a sample, two measures were calculated. One is the Euclidean distance between 
the center of a cluster and the mean vector of the corresponding normal component 
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of the normal mixture. The other is the Frobenius norm of the covariance matrix of 
a cluster minus that of the corresponding normal component. The mean values of the 
Euclidean distance and the Frobenius norm are shown in Table IH where DVl and DV2 
represent the mean value of the Frobenius norm for cluster 1 and 2, respectively. In 
this simulation setting, similar to the simulation result in subsection 4. 1, the centers and 
the covariance matrices obtained by the spontaneous clustering vary more than those 
obtained by MBC. 

To summarize, this simulation example reveals that the spontaneous clustering with 
AIC has almost the same performance as MBC with AIC. 



5 Data Analysis 



To evaluate the practical performance of the spontaneous clustering, we applied it with 
the fixed identity covariance matrix to real data as well as the Ji"-means algorithm. The 
data set consists of the chemical composition of 45 specimens of Romano- British pot- 
tery, d etermined by atomic absorption spectrophotometry, for nine oxides (jTubb et al. , 



1980) . Figure |9] shows the scatterplot matrix of data on Romano-British pottery. In 
addition to the chemical composition of the specimens, the kiln site at which the spec- 
imen was found is known. There exist five kiln sites, and they are from three different 
regions, so that we use the three regions as class labels. Our aim is to partition the 45 
specimens into clusters corresponding to the three classes by using only information 
about the chemical composition without knowledge about the class labels. The value of 
7 for the spontaneous clustering is determined by the two methods based on the range 
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of the data and AIC, respectively. The number of clusters for the /f-means algorithm is 
determined by CH and Gap. 

Table \5\ shows the result of the spontaneous clustering. The value of AIC and the 
number of clusters are shown in panel (a) of Figure [10] With optimal values of 7 based 
on the range and AIC, the spontaneous clustering detects three clusters corresponding 
to the three regions. In particular, the clustering result by the heuristic choice of 7 is the 
most correct. The scatterplot of AI2O3 variable suggests that the number of clusters is 
two, and the maximum range is obtained from the variable. This is associated with the 
scenario discussed in the derivation of the heuristic method, in which we assume the 
number of clusters is two. The values of CH and Gap are shown in panels (b) and (c) 
of Figure \T0\ They increase almost monotonically as the number of clusters increases, 
so CH and Gap do not work well for this data. As a result, we observe the spontaneous 
clustering based on the range and AIC can detect three clusters properly and partition 
the 45 specimens into clusters corresponding to the three regions. 

6 Discussion 

We proposed a new clustering algorithm based on the local minimization of the 7- 
ross function, which we named the spontaneous clustering. In the spontaneous clus- 
tering, the local minimum points of the 7-loss function are defined as the centers and 
covariance matrices of clusters. A large majority of statistical methods use the global 
minimum or maximum point of objective functions and try to avoid local minimum 
or maximum points. The convexity of the objective functions plays an important role 
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in statistics. For example, support vector machine has a convex loss function, and an 
efficie nt algorithm to obtain the global minimum point is considered based on the con- 



vexity (|BishopLl2006|) . Although nonconvexity is generally intractable, the spontaneous 



clustering benefits from the nonconvexity, which makes our method unique and inter- 
esting. The idea to use local minimum points of the 7-loss function can be applied to 
othe r statistical methods . For example, the idea is applied to principal com ponent anal- 
ysis (MoUah et al.l . 120101) and to estimation of Gaussian copula parameter (|Notsu et al. , 



2012 ) 



The spontaneous clustering does not require the information about the number of 
clusters a priori and can find it automatically if the value of power index 7 is properly 
fixed. In contrast, existing methods such as i^-means and model based clustering de- 
mand the number of clusters. Instead of the number of clusters, the value of 7 has to 
be determined in the spontaneous clustering. Two methods to determine the value of 7 
are proposed in this paper. One is a heuristic method which depends on the range of the 
data. Our simulation research shows that it has good performance in many situations, 
so we can usually use this heuristic method. A more sophisticated choice based on AIC 
is also proposed although it requires much computational effort. In the beginning of the 
research about selection of 7, we considered a cross validation technique, that is one of 



the co mm on procedures to sele ct the optimal value of a tuning parameter (|Hastie et al. , 



2009|) . In iMoUah et al.l (|2010[) the method using the cross validation is proposed for 



selection of 7. However, the method does not work well for the spontaneous clustering. 
Hence we employ AIC for selection of 7. It is demonstrated that our proposal works 
well by the simulation study and the real data analysis. 
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A Proof of Proposition 13^1 



No generality is lost by assuming ^2 = — /ii. The gradient of C^{g, 0(-, /x, /)) is given 
by 

'- — oc ri0(/i,/ii,(a +l/7)J)(^-/ii) 

+r20(/i, -flu (^' + l/7)^)(/i + /ii)- (12) 

From (fT2l) . every local minimum point of C^{g, 0(-, //, /)) should exist on the segment 
between —^i and ni. The Hessian matrix of Cj{g, 0(-, /i, /)) is given by 

oc -ri(/)(/i,/ii,(a +1/7)/)— — — (^ - ^i)(/i - /ii) 



dfidfi^ ' ' 1 + 0-27 

1 + cr^7 



-r20(/i, -/ii, (cr^ + 1/7)/),^-^ — r-(/i + /ii)(/i + /ii) 



+ri(/.(/i,/ii,((T2 + l/7)/)/ 

+r20(/i,-/ii,(a2 + l/7)J)/. (13) 

Let/i(t) = t/ii. From(fT3Tl. fiit) is a local minimum point of C-.,(q. di-. jjj.I)) if and only 
if t is a local minimum point of C^{g, 0(-, /i(t), J)) with respect to t. C^{g, 0(-, /x(t), /)) 
becomes 

C^{g, 0(-, /i(t), J)) (X -Ti exp(-C(t - l)^) - t^ exp(-C(t + 1)^), 

where C is equal to ||/ii|p7/(2(l + cr^7)). The derivative of C^((7, 0(-,/i(t), /)) is given 
by 

^C^(^,0(-,/z(t),J)) ex riexp(-C(t-l)2)(t-l) + r2exp(-C(t + l)2)(t + l). 
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It is possible to restrict —1 < t < 1. Then 

^C,(^,0(-,/i(t),J))>O 
^ exp(-C(t + lf + C(t-lf)>jl^ 

[t + 1)T2 

^^ -ACt + log(t + 1) - log(l - t) - log — > 0. (14) 

Let h{t) be the left hand side of inequality (fT4l) . The derivative of h{t) is given by 

1 1 



h'it) = -AC 



t + 1 l-t' 
and 



h'{t)>Q ^^ -4C(l-t2) + (i_t) + (i + t) >o 



t' - f 1 - 4. 1 > 0. 



2C, 
If 1 — 1/(2C) < 0, then h'(t) > 0, and C^{g, <!){■, n{t), I)) has one local minimum 

point. Hence C^{g, 0(-, /i(t), J)) has two local minimum points if and only if 
1 - ^ > 0, h{-D) > 0, h{D) < 0, 



where D is the positive solution of equation h'{t) = 0, that is D = y 1 — 1/(2C). 
Condition 1 - 1/(2C) > is equivalent to ||/Uip - (a^ + I/7) > 0. Condition 
h[—D) > is equivalent to 



27 „ „ /„ no f , 1 

1 + cr^7 y V 7 



7 /„. „ , /„ no r. i\ \ n 



>TT^lll'"ll + V'"""'"r + 7Ji V 



and condition h{D) < is equivalent to 



27 „ „ /„ no f o 1 

1 + 0-^7 V V 7 



2 



<^ l,M-A.P-(.^.i)U, 
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Note that fil is on the line between D^i and /ii. Similarly (— /ii)* is on the line between 
— /ii and —Dfii. Then 



\f^*i -/iill < (1 - D)\\fii\\ = WfiiW - -^/ll/iiP- ia^ + -]. 

If n = r2, then h{±l) = ±cx), /i(0) = 0. Condition 1 - 1/(2C) > is equivalent 
to h'(0) < 0. Hence two conditions h{—D) > 0, h(D) < hold whenever condition 
1-1/(2C) >0 holds. D 

B 7-divergence and 7-loss Function 

The aim of this section is to give a general introduction t o the 7-divergenc e and the 



7-I0SS function. A more detailed discussion can be found in 



EguchiandKatdOOld '). 



B.l 7-divergence 

Suppose a random sample is generated from a population distribution with density func- 
tion g. Let {f{-,9)} be a family of density functions indexed by parameter 9. The 
7-cross entropy between g and f{-,9) is defined as 



C,{g, /(-, 6)) = -K,{e) j g{x)fix, Oydx, 
with power index 7 > 0, where k^{6) is the normalizing constant defined as 

K^{e) = ( f f{x,ey+^dx 

The Boltzmann-Shannon cross entropy between g and /(■, 9) is defined by 



7 
1+7 



- / 9ix)\ogf{x,e)dx. 
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The 7-cross entropy and the Bokzmann-Shannon cross entropy have the following re- 
lation since k^{9) converges to 1 if 7 tends to 0. 



lim —^ — ^ ^ = — I qix) lim ( ''- dx 

7^0 7 J 7-^0 V 7 



g{x)\ogf{x,e)dx. 

Hence the Boltzmann-Shannon cross entropy can be seen as the 0-cross entropy, and 
the 7-cross entropy can be regarded as an extension of the Boltzmann-Shannon cross 
entropy. The 7-entropy of g is defined as H^[g) = C^{g,g); the 7-divergence between 
g and f{-,9) is defined as 

D,{gJ{;e)) = C,{gJ{;9))-H,{g). 

Note that the 7-divergence Dj{g, /(■, 9)) is nonnegative, and D^{g, /(-, 9)) is equal to 
if and only if 6* satisfies that 5f(a;) = f{x,9) almost everywhere x. From these properties, 
Dy{g, f{-,9)) can be seen as a kind of distance between g and f{-,9) although it does 
not satisfy the symmetry. When our aim is to find the closest distribution to g in model 
{/(■) ^)} with respect to the 7-divergence, we only have to find the global minimum 
point of D^{g, f{-, 9)) with respect to 9, which is equal to that of C^{g, f{-, 9)). 

B.2 7 -loss Function 

The 7-I0SS function is defined by an estimator of the 7-cross entropy. Let {xi,X2, 
. . . , Xn} be a random sample generated from a population distribution with density 
function g and {/(■, 9)} be our statistical model. The 7-loss function for /(-, 9) associ- 
ated with the 7-divergence is given by 

1 " 



n 

i=l 
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We extend the definition of the 7-cross entropy to any distributions. For any distribution 
function G, the 7-cross entropy between G and /(-, Q) is defined as 



c,(G,/(,^)) = -K,{e) j f{x,eydG{x). 

Note that L^{9) equals G^{G, f{-,9)) with empirical distribution function G, so that 

E{Lj{9)) = G^{g, f{-,9)), and L-^{9) almost surely converges to C^(q, f(-,9)). The 



7-esti mator of 9 is defined by the global minimum point of L^^{9) (|Eguchi and Katd . 



2010|) . From the definition of the 7-estimator, it satisfies Fisher consistency. If the 
density function g belongs to the statistical model {/(■, 9)}, then the 7-estimator satis- 
fies asymptotic consistency and normality. The 7-loss function and the log likelihood 
function satisfy the following relation 



H ^ = --2^1og/(x„^). 



7-5.U 7 n 

Hence the MLE can be regarded as the 0-estimator and the 7-estimator can be seen as 
an extension of the MLE. 
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Figure 1: (a) Log likelihood function, (b) Minus 7-loss function (7 = 1). In panels 
(a) and (b) the data of size 200 is generated from the mixture of two standard normal 
distributions centered at and 10, respectively. 
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Figure 2: Example data generated from the mixture of two normal distributions centered 
at (0, 0)^ and (5, 0)^ with the identity covariance matrix, respectively. 
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Figure 3: Visualization of T\ (/i), Tif (/i), and T^{jj). In panel (a) the sample of size 
100 is generated from normal mixture O.50(x, —2, 0.04) + 0.5</)(x, 2, 0.04). In panel 
(b) the sample of size 200 is generated from normal mixture O.250(a;, —5.5, 0.04) + 
0.25(/.(x, -2, 0.04) + 0.25(/)(x, 2, 0.04) + 0.25(p{x, 5.5, 0.04). 

(a) (b) 





Figure 4: Illustration of — C^((7, 0(-,/i, /)). In panel (a) fii = (0,0)^, /i2 
(2,2)T,ri =T2 = 0.5,7 = 1,^^ = 1. In panel (b) /zi = (0,0)T,/i2 = (4,4)^,ri 
r2 = 0.5,7 = 1,^2 = 1. 
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Figure 5: (a) Five clusters, (b) Same as (a) but colored according to cluster. 
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Figure 6: Value of AIC and number of clusters. 
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Table 1: Frequencies of Choosing K Clusters. 



K 



12 3 4 



Spontaneous clustering with the range 9 91 



Spontaneous clustering with AIC 



1 99 



K-means with CH 



100 



fC-means with Gap 



91 7 



Table 2: Mean Value of BHI and DM1-DM5. 



BHI DM1 DM2 DM3 DM4 DM5 



Spontaneous clustering with the range 0.93 0.38 0.38 0.37 0.33 0.34 
Spontaneous clustering with AIC 0.94 0.34 0.32 0.28 0.27 0.26 



K-means with CH 



0.95 0.25 0.23 0.21 0.21 0.21 



/r-means with Gap 



0.22 0.16 0.49 0.23 0.41 0.21 
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Figure 7: (a) Two clusters, (b) Same as (a) but colored according to cluster. 
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Figure 8: (a) Value of AIC. (b) Number of clusters. 



Table 3: Frequencies of Choosing K Clusters. 



K 



1 2 3456789 10 



Spontaneous clustering 01 00 0000000 



MBC 



61 13 344345 3 



Table 4: Mean Value of BHI and DM1, DM2, DVl, and DV2. 

BHI DM1 DM2 DVl DV2 

Spontaneous clustering 1.00 0.12 0.20 0.33 0.58 

MBC 0.99 0.10 0.16 0.22 0.48 
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Figure 9: Scatterplot matrix of data on Romano-British pottery. The red, blue, and 
greed circles correspond to the three regions. 



Table 5: Result of the Spontaneous Clustering. 
Method 7 Number of clusters BHI 



Range 0.63 
AIC 0.35 



3 1 

3 0.96 
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Figure 10: (a) AIC and number of clusters, (b) CH. (c) Gap. 
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